perm filename RESULT[DIS,DBL]5 blob sn#219335 filedate 1976-06-12 generic text, type C, neo UTF8
COMMENT ⊗   VALID 00011 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00002 00002	.NSECP(Results)
C00004 00003	.SSEC(What AM Did)
C00006 00004	. SSSEC(AM as a Mathematician)
C00007 00005	. SSSEC(AM as a Computer Program)
C00011 00006	.SSEC(Experiments with AM)
C00021 00007	. SSSEC(Must the Worth numbers be finely tuned?)
C00031 00008	. SSSEC(How finely tuned is the Agenda?)
C00040 00009	. SSSEC(What if certain key concepts are eliminated?)
C00042 00010	. SSSEC(What if certain heuristics are tampered with?)
C00043 00011	. SSSEC(Can AM work in a new domainα: Plane Geometry?)
C00045 ENDMK
C⊗;
.NSECP(Results)

.R1PAGE: PAGE;

This chapter opens  by summarizing what  AM "did". Section 1  gives a
fairly high-level description of the major paths which were explored,
the concepts discovered along  the way, the relationships which  were
noticed, and ones which "should" have been but but weren't.

The next section continues this  exposition by presenting the results
of experiments which were done with (and ⊗4on⊗*) AM.

.ONCE TURN ON "{}"

Chapter {[2] EVALU} will draw upon these results -- and others given
in the appendices -- to form conclusions about AM. Several meta-level
questions will be tackled there (e.g., What are AM's limitations?).

.SSEC(What AM Did)

<<This contains duplicate info. to that given in Chap. 1: AM as mathematician>

AM is both a mathematician of sorts, and a big computer program.

.BEGIN TURN ON "{}"

By granting  AM more anthropomorphic  qualities than it  deserves, we
can  describe  its  progress  through  elementary  mathematics.    It
rediscovered many  well-known  concepts,  a  couple  interesting  but
not-generally-known ones,  and several  concepts which  were hitherto
unknown    and   should    have   stayed    that   way.       Section
{SECNUM}.{SSECNUM}.1 recaps what  AM did, much  as a historian  might
critically evaluate Euler's work.


By  under-estimating AM's sophistication,  one can demand  answers to
the typical questions to ask about a computer program: how big is it,
how much cpu time does it use, what is  the input/output, etc.  These
are treated in Section {SECNUM}.{SSECNUM}.2.

.END

. SSSEC(AM as a Mathematician)

<<See p.11 OVERV -- chap 1 -- for updated version. Why repeat?>

. SSSEC(AM as a Computer Program)

<<Is this section worth including? Change of format, perhaps?>

When viewed as a large LISP program, there is very little of interest about
AM. There are the usual battery of customized functions (e.g., a conditional
PRINT function), the storage hacks (special emergency garbage collection
routines, which know which facets are expendible), the time hacks
(omniciently arrange clauses in a conjunction so that the one most likely
to fail will come first), and the bugs (if the user renames a concept
while it's the current one being worked on, 
there is a 5% chance of AM entering an infinite loop).

Below are listed a few parameters of the system, although I doubt that
they hold any theoretical significance. The reader may be curious about how
big AM, how long it takes, etc.



Machine: SUMEX, PDP-10, KI-10 uniprocessor, 256k core memory.

Language: Interlisp, January '75 release, 
which occupies 140k, but which provides a surplus  "shadow space"
of 256k additional words available for holding compiled code.

AM support code: 
200 compiled (not block-compiled) utility routines, control routines, etc.
Occupy roughly 100k, but all are pushed into the shadow space.

AM itself: 115 concepts, each occupying about 1k. Facet/entries stored as
property/value on the property list of atoms whose names are concepts' names.

Snazzy feature:
Executable entries on facets  (e.g., an entry on Union.Alg) are stored uncompiled
until the first time they are actually called on, at which time they are compiled
and then executed.

"Final" state of AM: 300 concepts, each occupying about 1k. Many are swapped out
onto disk. Original 115 concepts have grown to an average size of 2k.

Cpu time (and number of tasks chosen) at the moment various discoveries were made:

.BEGIN FILL PREFACE 0 INDENT 10 TURN ON "\" TABS 30,45

Cardinality:\15 minutes;\40 tasks

Unique factorization:\25 minutes;\60 tasks

<Maybe give a few more, intermediate ones>

.E

<<Other details about AM as a computer pgm>

.SKIP 1

.SSEC(Experiments with AM)

.EXPT: SECNUM;

.EXPTSSEC: SSECNUM;

.EXPTPAGE: PAGE;

Now we've  described the activities  AM carried  out during its  best
run.   AM was working by itself, and each  time executed the top task
on the  agenda.   It  received no  help  from the  user, and  all its
concepts' Intuitions facets had been removed.

One valuable aspect of  AM is  that it  is amenable  to many  kind of
interesting experiments.  Although AM is too ⊗4ad hoc⊗* for numerical
results to have much significance, the qualitative results perhaps do
have  some   valid  things  to  say  about   research  in  elementary
mathematics, about  automating  research,  and  at  least  about  the
efficacy of various parts of AM's design.

This section will explain  what it means to perform  an experiment on
AM,  what kinds  of experiments  are imaginable,  which of  those are
feasible, and  finally will  describe the  5  experiments which  were
performed on AM.

By modifying AM in various ways, its behavior can be altered, and the
⊗4quality⊗*  of  its  behavior  will change  as  well.  As  a drastic
example, if all the heuristics were removed, then AM  would just "sit
there"  -- with  no  plausible generators,  its  agenda would  remain
empty.

By careful planning, each experiment can tell us something new  about
AM: how  valuable a  certain piece  of it  is, how  robust a  certain
scheme  really is, etc.   The resuts of these  experiments would then
have  something  to   contribute  to  a   discussion  of  the   "real
intelligence"  of  AM  (e.g.,  what features  were  superfluous),  and
contribute to  the design of the "next" AM-like system.  Generalizing
from those  results,  one might  suggest  some hypotheses  about  the
larger task of automated math research.

Let's cover the different  ⊗4kinds⊗* of experiments one could perform
on AM:

(i) Remove individual  concept modules,  and/or individual  heuristic
rules. Then  examine how  AM's performance  is degraded.   AM  should
operate even  if most of its heuristic rules  and most of its concept
modules were excised, so  long as ⊗4any⊗*  facets of  any of the  concepts
still exist.  If the remaining  fragment of AM is too small, however,
it may  not be able to find anything interesting to do. In fact, this
situation was actually encountered experimentally, when the first few
partially complete concepts were inserted. If only a little bit of AM
is removed, the remainder  will in fact  keep operating without  this
"uninteresting collapse".   The converse situation should  also hold:
although  still functional  with any  concept module  unplugged, AM's
performance ⊗4should⊗*  be  noticably degraded.  That is,  while  not
indispensible, each concept should nontrivially help the others.  The
same  holds for each individual heuristic  rule.  Which concepts does
AM now  "miss" discovering?  Is the  removed concept/heuristic  later
discovered  anyway  by those  which  are left  in  AM?   This  should
indicate the importance  of each kind  of concept and  rule which  AM
starts with.

(ii) Vary  the relative  weights given  to features  by the  criteria
which  judge aesthetics, interestingness,  worth, utility,  etc.  See
how important each factor is in directing AM along successful routes.
In other  words, vary the  little numbers  in the formulae  (both the
global  formulae and  the local  ones inside  heuristic rules).   One
important result will be  some idea of the robustness  or "toughness"
of the numeric weighting  factors. If the system easily collapses, it
was too finely tuned to begin with.

(iii) Add several new concept modules (including new heuristics)  and
see if AM can  work in some unanticipated field  of mathematics (like
graph theory or calculus or plane geometry).  Do earlier acheivements
-- new  concepts and  relationships involving  concepts  -- have  any
impact in  the new domain?  Are some specialized heuristics  from the
first  domain totally wrong  here? Do all the  old general heuristics
still  hold  here?  Are  they  sufficient,  or   are  some  "general"
heuristics  needed here which  weren't needed  before? Does  AM "slow
down" as more and more concepts get introduced?

(iv) Try to have AM 
develop nonmathematical theories  (like elementary physics, or program
verification).  This  might  require  limiting  AM's  freedom to
"ignore  a  given  body  of  data  and  move  on  to  something  more
interesting". The exploration of  very non-formalizable fields (e.g.,
politics) might require much more than a small augmentation of AM's
base of concepts.

(v) Add several  new concepts dealing with  proof, and of course  add
all the associated heuristic rules. Such rules would advise AM on the
fine points of  using various techniques  of proof/disproof: when  to
use them, what to try next based on why the last attempt failed, etc.
See if the ⊗4kinds⊗* of discoveries AM makes are increased.


During  the few weeks
prior to the writing of this document,
several experiments (of  types i, ii, and
iii above$$  experiments  of type  (iv)  weren't tried and are left as
"open problems", as invitations for future research efforts.
Experiement (v)  will probably occupy the summer of 1976. $) have been
set up and performed on  AM. We're now ready to examine each  of them
in detail.  The following points are covered for each experiment:

.BN

λλ How was it thought of? 

λλ What will be gained by  it?  What would be the implications of the
various possible outcomes?

λλ How was the experiment set up? What preparations/modifications had
to be made? How much time (man-hours) did it take?

λλ What happened?  How did AM's  behavior change? Was  this expected?
Analysis.

λλ  What was learned from  this experiment? Can  we conclude anything
which suggests new  experiments (e.g.,  use a better  machine, a  new
domain) or which bears on a  more general problem that AM faced (e.g.,
a new way to teach math, a new idea about doing math research)?

.E

. SSSEC(Must the Worth numbers be finely tuned?)

.EX3PAGE: PAGE;

Each of the 115 initial concepts 
has  supplied to it (by  the author) a number  between 0
and 1000,  stored as its Worth facet,  which is supposed to represent
the overall  value of  the concept.  "Compose" has  a higher  initial
Worth than "Structure-delete", which is  higher than "Equality"$$ As AM
progresses, it notices something interesting about Equality every now
and then, and pushes its  Worth value upwards.   $.

Frequently, the priority of a task involving C depends on the overall
Worth of C. 
How  sensitive is
AM's  behavior to  the initial  settings of  the Worth  facets?   How
finely tuned must these initial Worth values be?

To test this, a simple experiment was performed. Just before starting
AM, the mean value of all concepts' Worth values was computed (around
200). Then  each concept had its Worth reset to  the value 200.  This
was done "by hand", by  the author, in a  matter of seconds.  AM  was
then started and run as if  there were nothing amiss, and its behavior
was watched carefully.

What happened?  By and large, the same major discoveries were made --
and missed -- as usual, in  the same order as usual.  But  whereas AM
proceeded fairly smoothly before, with little superflous activity, it
now wandered quite blindly  for long periods  of time, especially  at
the  very beginning.  Once  AM "hooked  into"  a line  of  productive
development,  it  followed  it  just  as  always, with  no  noticable
additional wanderings.  As one of  these lines  of developments  died
out, AM would wander around again, until the next one was begun.

It took roughly three times as long for each major discovery to occur
as  normal.  This "delay"  got shorter  and  shorter as  AM developed
further.   In  each  case,  the  tasks preceding  the  discovery  and
following  it were pretty  much the  same as  normal; only  the tasks
"between" two periods of development were different -- and much  more
numerous.  The  precise  numbers  involved  would  probably  be  more
misleading than  helpful, so they will not  be given$$ Any reader who
wishes to  perform  this experiment  can  simply say  [MAPC  CONCEPTS
'(LAMBDA  (c) (SETB  c WORTH  200] to  Interlisp, just  before typing
(START) to begin AM. $.

The  reader may be interested to learn  that the Worth values of many
of the  concepts -- and  most of the  new concepts  -- ended up  very
close  to the same  values that  they achieved  in the  original run.
Overrated concepts were  investigated and  proved boring;  underrated
concepts had  to wait longer  for their  chances, but then quickly  proved
interesting and had their Worth facets boosted.

The conclusion I  draw from this change in behavior is that the Worth
facets are useful for making blind decisions -- where AM  must choose
based  only on  the overall  worths of  the various  concepts in  its
repertoire.  Whenever  a specific  reason  existed, it  was  far more
influential than  the "erroneous"  Worth values.   The close,  blind,
random decisions occur  between long bursts of specific-reason-driven
periods of creative work.

The general answer, then, is ⊗4No⊗*, the initial settings of the Worth
values are not crucial. Guessing reasonable initial values for them is
merely a time-saving device.$$
This suggests
an interesting research problem: what impact does
the quality of initial starting values have on humans? Give several bright
undergraduate math majors the same set of objects and operators to play with,
but tell some of them (i) nothing, and some of them (ii) a certain few pieces of
the system are very promising, (iii) 
emphasize a diffferent subset of the objects and operators.
How does "misinformation" impede the humans? How about no information?
Have them give verbal protocols about where they are focussing their attention,
and why. $

Albeit at a  nontrival cost, the  Worth facets did manage  to correct
themselves  by the  end of  a long$$  A couple  cpu hours,  about a
thousand tasks  total selected from  the agenda  $ run.   What  would
happen  if the  Worth  facets of  those 110  concepts  were not  only
initialized  to 200, but were  held fixed at 200  for the duration of
the run?

In this  case, the  delay factor still decreased.  That is,  AM
still got more and more "back to normal" as it progressed. The reason
is  because  AM's  later  work  dealt  with  concepts  like   Primes,
Square-root, etc., which were so far removed  from the initial base of
concepts that the initial concepts' Worths were of little consequence.

Even  more drastically, we  could force all  the Worth  facets of all
concepts -- even newly-created  ones -- to be  kept at the value  200
forever. In this case,  AM's behavior doesn't completely disintegrate,
but  that delay factor  actually increases with  time: apparently, AM
begins to suffer from the exponential growth of "things to do" as its
repertoire  of  concepts  grows  linearly.   Its  purposiveness,  its
directionality depends on "focus of attention" more and more, and  if
that feature is removed,  AM loses much of its rationality.   A factor
of 5  delay doesn't sound that bad  "efficiency-wise", but the actual
apparent  behavior of  AM  is  as  staccato  bursts  of  development,
followed  by wild  leaps  to  unrelated concepts.  AM  no longer  can
"permanently" record its interest in a certain concept.

So  we conclude that the  Worth facets are (i)  not finely tuned, yet
(ii) provide important  global information about the  relative values
of  concepts.   If  the  Worth facets  are  completely disabled,  the
rationality of AM's behavior hangs on the slender thread of "focus of
attention".

. SSSEC(How finely tuned is the Agenda?)

.RTEXSSSEC: SSSECNUM;

.COMMENT This assumes that expt number = sssec number;

.RTEXNO: SSSECNUM; 

The top few candidates  on the agenda always appear  to be reasonable
(to me).   If I work with  the system, guiding it, I  can cause it to
make a few discoveries it wouldn't otherwise make, and I can cause it
to make its typical ones much faster  (about a factor of 2). Thus the
⊗4very⊗* top task is not always the best.

If  AM randomly selects one of  the top 20 or  so tasks on the agenda
each time, what  will happen to  its behavior? Will it  disintegrate,
slow down by a factor of 10, slow down slightly,...?

This experiment required only a few seconds to set up, but demanded a
familiarity with  the  LISP  functions which  make  up  AM's  control
structure.   At  a  certain  point,  AM asks  for  Best-task(Agenda).
Typically,  the LISP function  Best-task is  defined as CAR  -- i.e.,
pick the  first member from  the list  of tasks.  What I  did was  to
redefine Best-task as  a function which randomly selected  n from the
set  {1,2,...,20},  and  then  returned  the  n⊗Ath⊗*  member of  the
job-list.

If you watch the top job on the agenda, it will  take about 10 cycles
until  AM  chooses it.  And  yet there  are  many good,  interesting,
worthwhile jobs sprinkled  among the top  20 on the  agenda, so  AM's
performance is cut by merely a factor of 3, as far as cpu time per
given major  discovery.  Part of this  better-than-20 behavior is due
to the fact  that the 18⊗Ath⊗*  best task had  a much lower  priority
rating than  the top few, hence was  allocated much less cpu  time for
its  quantum  than  the top  task  would have  received.    Whether it
succeeded or  failed, it  used up  very little  time.   Since AM  was
frequently  working on a  low-value task,  it was unwilling  to spend
much time or space on it. So  the mean time allotted per task fell  to
about 15 seconds (from the typical 30  secs). Thus, the "losers" were
dealt  with quickly,  so the  detriment  to cpu-time  performance was
softened.

Yet AM is much less rational in its sequencing of tasks. A topic will
be dropped  right in the middle,  for a dozen cycles,  then picked up
again.   Often a  "good" task will  be chosen, having  reasons all of
which were true  10 cycles ago --  and which are clearly  superior to
those  of the last  10 tasks. This  is what  is so annoying  to human
onlookers.

Conclusion: Picking (on the average) the 10th-best candidate  impedes
progress by  a factor  less  than 10  (about a  factor of  3), but  it
dramatically   degrades  the  "sensibleness"  of   AM's  behavior,  the
continuity of  its actions.   Humans place  a big  value on  absolute
sensibleness, and believe that doing  something silly 50% of the time
is  ⊗4↓_much_↓⊗* worse  than half as  productive as  always doing the
next most logical task.

Corollary: Having 20 multi-processors simultaneously execute the top
20 jobs will increase  the rate of "big" discoveries, but not by a full
factor of 20.

Another experiment in this same vein was done, one which was designed
to be far more  crippling to AM. Be-threshhold was held  at 0 always,
so  ⊗4any⊗*  task which  ever got  proposed  was kept  forever  on the
agenda, no matter  how low its priority.  The Best-task function  was
modified so it randomly selected any  member of the list of jobs. As
a final insult, the Worth facets of all the concepts were initialized
to 200 before starting AM.

Result: Many  "explosive" tasks were  chosen, and  the number of  new
concepts  increased rapidly.  As expected,  most of  these  were real
"losers".  There seemed no  rationality to AM's sequence of  actions,
and it  was  quite boring to  watch  it floundering  so. The  typical
length of the agenda was about 500, and AM's performance was "slowed"
by at least a couple  orders of magnitude. A more subjective  measure
of its "intelligence" would say  that it totally collapsed under this
random scheme.

Conclusion: Having an unlimited number of processors simultaneously 
execute all the jobs
on the agenda would increase the rate at which AM made big discoveries, at an
ever accelerating pace (since the length of the agenda would grow exponentially).

Having a uniprocessor ⊗4simulate⊗* such parallel processing would be a losing idea,
however. The truly "intelligent"  behavior AM exhibits  is its plausible
sequencing of tasks.

Let's dig  inside the  agenda scheme  now. One  of the  highly-touted
ideas in AM is the  attaching of reasons to the tasks  on the agenda,
and  using those  reasons and  their ratings  to compute  the overall
prioirity value  assigned to  each task.   An  experiment  was done  to
ascertain the  amount of  intelligence that  was emanating  from that
idea.

The  global  formula assigning  a  priority  value to  each  job was
modified. We let it still  be a function of the reasons for  the job,
but we "trivialized" it: the priority of a job was computed as simply
the number of reasons it has.  (normalized by multiplying by 100, and
cut-off if over 1000).

This raised the new question  of what to do if several  jobs all have
the  same  priority. I  suppose  the  answer is  to  execute  them in
stack-order (most recent  first)$$ Why? Because  (i) it sounds  right
intuitively  to me,  
(ii) this is akin to human focus of attention,
and  mainly because  (iii) this  is  what AM  did
anyway, with no extra modification. $.

Result: <<This experiment hasn't been performed yet!>

Conclusion:

. SSSEC(What if certain key concepts are eliminated?)


Feeling  in  a  perverse mood  one  day,  I  eliminated  the  concept
"Equality" from AM, to see what it would then do.  Equality was a key
concept, because  AM  discovered  Numbers via  the  technique  of
generalizing  the  relation "Equality"  (exact  equality  of 2  given
structures,  at  all  internal  levels).   What  would  happen  if we
eliminate this  path?  Will  AM rederive  Equality?  Will it  get  to
Cardinality via another route? Will it do some set-theoretic things?

Result: <<This expt (elim. equality) is not yet done!>

Similar experiments: eliminate various other specific concepts, add a
few specific ones, modify some.

. SSSEC(What if certain heuristics are tampered with?)

add/delete/modify certain heuristics

not done yet

. SSSEC(Can AM work in a new domainα: Plane Geometry?)

This big experiment has already been done, but needs to be written up
nicely.

The author  added a new base  of concepts to the  ones already in AM.
Included  were:   Point,   Line,   Angle,   Triangle,   Equality   of
pts/lines/angles/triangles.

Results:  fairly good  behavior.  Derived  congruence, similarity  of
triangles.  Derived the idea of timberline in several ways.

Use for Goldbach's conjecture: any angle (0-180 degrees) can be built
up (to within  1 degree) as  the sum of  two angles of prime  degrees
(<180).

.GEOEX: PAGE;